Final Assignment: Data Analysis

1. Get started
2. Setup
3. The data
4. What you need to do
5. The program
- 5.1. Meets expectations
- 5.2. Exemplary
  - 5.2.1. (At least) two calculations must be comples
  - 5.2.2. Limited repeated code
6. The writeup
- 6.1. Style of writing
7. Submission schedule
8. The final submission
9. Citing your data
10. Possible data sources
- 10.1. US House of Representatives Election data
  - 10.1.1. More complex tasks you might consider
- 10.2. Yelp restaurant review data
  - 10.2.1. More complex tasks you might consider

This assignment is to be done individually. You can talk to other people in the class, me (Dave), the prefect, and lab assistants for ideas and to gain assistance. You can help each other debug programs, if you wish. The code that you write should be your own, however, and you shouldn’t hand a printout of your program to others. See the course syllabus for more details or just ask me if I can clarify.

1. Get started

Create a folder in which to store your work for this assignment.

If you are working on your own computer, it’s up to you where to put the folder. Your desktop is likely as good a place as any. Make a folder titled dataanalysis.
If you are working in the labs in Olin, make sure to first mount the COURSES folder, so that you won’t lose your code when you log out. Once you’ve done so, open up Finder, then navigate to your personal student work folder. You can then make a dataanalysis folder within there.
Once you’ve done so, you should then open up your new folder in VS Code. To do so, start up VS Code, then drag your folder onto the VS Code window. This should open up the folder within VS Code. If you are asked, click that you trust the authors.

2. Setup

One of the most common uses of programming is to be able to analyze data. Whether it is from surveys, machine collected data, or otherwise, nearly every field can learn from collected data. For this assignment, you’ll find some patterns in a dataset of your choice, and write about what you learn in an accompanying writeup.

3. The data

I will supply two different datasets from which you can choose one, if you like. I’ll also pose some potential questions you might think about trying to answer with them. But you can also find a dataset of your own if you wish. Getting the data into Python can be tricky (or not) depending on the actual data, so make sure you leave yourself time to see if you can do it before you get two attached to a particular source of data.

4. What you need to do

I am intentionally leaving this fairly open-ended: I’d encourage you to find something that interests you, and generate some results that you think are worthwhile or even perhaps surprising.

This assignment is really two assignments wrapped into one: the program that you’ll write to generate your results, and the writeup that you’ll put together describing what you did and what you found. Each of these will be graded separately, with more detail to be found below.

5. The program

5.1. Meets expectations

5.1.1. 3 different calculations

“M” work for this assignment means that you should produce at least 3 different results from your data. Averages, medians, and ranking are all fabulous options. There are many fancier things that you can do, and I’d encourage you to think through what you can do to make your results more interesting. The (at least) 3 different things that you measure need to be different kinds of measurements from each other, i.e. they should not be “average of blue data,” “average of green data,” etc.

5.1.2. A more complex calculation

For “M” work, at least one of your calculations should be more complex than a typical average, median, etc, of the rows in the data. More complex means “more complex coding needed.” It does not mean “more complex math needed.” In other words, you would need to

have to explain in at least a sentence or two what that measurement is doing or how you accomplished it, and
it cannot be described via a direct mathematical formula. It likely requires the use of additional loops, variables, functions, or other such work
it is not just a minor tweak on a simpler measurement.

A common misconception that some students have had is that fancy statistical techniques are acceptable here, but they are not. (A standard deviation calculation may sound “complex” from a mathematics perspective, but it’s not really any harder to code than just taking an average, and the goal here is to demonstrate the coding skills you have learned in this course.) Use a comment in your code to indicate where this calculation is. You’ll find examples of these sorts of measurements at the end of this description.

5.1.3. Calculations clearly labeled

Finally, in order to receive a grade of “M”, you must comment clearly in your code where the “more complex” calculation can be found.

5.1.4. No user input

The program that you write should not be interactive; it should not prompt the user to answer any questions. You should put directly into your code any info you need for it to do the calculations you need it to do. Any results that you use in your writeup should be created as a result of running your program. (It’s ok to submit multiple programs, if need be.)

5.2. Exemplary

5.2.1. (At least) two calculations must be comples

For “E” work, you must meet the characteristics for “M” work above, but at least two of your calculations should be in the “more complex” category, such as those described at the end of this assignment. They should be different from each other.

In order to receive an “E” grade, you must also comment clearly in your code where the “more complex” calculations can be found.

5.2.2. Limited repeated code

Your program should not have nearly identical code repeated multiple times. Find a way to manage this with a function if need be.

6. The writeup

“M” work for the writeup means that it should be at least 500 words, where it documents what you did. Specifically, summarize your approach, what the data represented, what kinds of calculations you did, and what the results are. Your work must include text addressing each of the above points. (Charts and graphs are awesome and you should include them if you can, but they don’t count towards the word count.)

“E” work means that in addition to the “M” work above, you have carefully made sure to interpret your results. In other words, you have not merely reported on what you did and what the outcomes were, but you have included text that analyzes the results you obtained and explains what the meaning of that result is. Precisely how you do this will undoubtedly vary based on what you calculated, but the idea here is that you will have carefully thought about and explained why your results are important and/or surprising, or perhaps why there are entirely unsurprising. You don’t need more than a paragraph for each of your three calculations to do this. The key thing you need to do is to go beyond merely reporting the results you see; you should discuss their meaning.

6.1. Style of writing

When writing up work that you’ve done, the style in CS is that you shouldn’t describe what you did at the level of Python, since your program already does that. Rather, describe at a higher level indicating what you did conceptually. For example, go with text in a style like the below…

I calculated the flapdoodle by first reading in the data, and throwing away data that was irrelevant to calculating the flapdoodle. I then randomly shuffled the data, and drew the third value.

and not text like

I wrote a for loop, which looped over a range from 0 to the number of rows in the data, minus one. Within that for loop, I did an if to see if row==None, and if not, I added that value to a Python list named flapdoodles. I then used the random.shuffle method on that list. Finally, I wrote flapdoodles[2] to get the third value.

7. Submission schedule

The Moodle schedule indicates when the project itself is due, which is at the end of the final exam period. I CANNOT PERSONALLY GRANT AN EXTENSION BEYOND THE END OF FINAL EXAMS UNDER ANY CIRCUMSTANCES. The College has strict policies that forbid faculty from taking assignments any later than the end of final exams without an extension granted by the Dean of Students office. Here is more information about that process.

8. The final submission

The two portions of the assignment will be listed separately in Moodle; you should submit to each.

9. Citing your data

If you use data that I have not provided, remember to cite somewhere where you got it from. CS as a field is generally not finicky about the exact format that you use for a citation, but it is very important for you to describe where your data came from to make sure to appropriately give credit to the people or organizations that generated it.

10. Possible data sources

Here are some options you can use, with some ideas on what you might do with it. You’ll find the data itself linked on the Moodle page.

10.1. US House of Representatives Election data

This file shows for a given year, state, and US congressional district the number of votes given to the candidates from each of the two major political parties.

10.1.1. More complex tasks you might consider

The data shows votes by both parties, by state and district, over time. Can you sum up the votes for each state, and see which states have changed their political alignment the most over time?

US Congressional districts are regularly formed by gerrymandering, which a technique for drawing congressional districts so that a minority party can gain a majority (or, at least a disproportional number of) of the congressional seats for the state. A few years ago, a measurement called the efficiency gap was proposed as a way of quantifying gerrymandering. You could measure the efficiency gap by state, by year, and produce results showing which states are the worse at gerrymandering, and which political parties they favor.

10.2. Yelp restaurant review data

This set of files, kindly provided by some folks at UC Berkeley, contain lots of restaurant reviews extracted from the Yelp Open Dataset. The original data is too sizeable for the scope of this project, so we’re working with the subset they produced, which is about restaurants near Berkeley, CA. But the techniques you’ll use would apply for restaurants from any location. This data consists of three files: one with data on restaurants, one with data on “users” who visit the restaurants, and one with the reviews that the users write about the restaurants. Each restaurant has a business id (in the restaurants file), and each user has a user id (in the users file). The reviews file contains a list of reviews and ratings by particular users for particular restaurants, as well as text reviews.

10.2.1. More complex tasks you might consider

Do users tend to review restaurants they like, or do they visit a variety? Does this vary by price of restaurant, or by typical length of review?

For each restaurant, you could look at the difference between the length of the longest review and the shortest review that they have been given. You could then average those differences, within price category of the restaurant, to see if cheap restaurants receive more variations among length of reviews than expensive restaurants.
For each user, you could measure the difference between the highest rating they have given to a restaurant, and the lowest rating they have given. That gives a variability score for each user. You could look to see if users who write long text reviews tend to give more or less varied ratings than those who write short reviews.
There are many variations on the above: be creative!